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Abstract 

Oracle inequalities and variable selection properties for the Lasso in linear models 
have been established under a variety of different assumptions on the design matrix. 
We show in this paper how the different conditions and concepts relate to each other. 



The restricted eigenvalue conditio n (jBickel et al. 2009 1 or the slightly weaker com- 



patibility condition (van de Geer 20071 are sufficient for oracle results. We argue 
that both these conditions allow for a fairly general class of design matrices. Hence, 
optimality of the Lasso for prediction and estimation holds for more general situations 
than what it appears from coherence (Bunea et al. 2007b|c I or restricted isometry 
(Candes and Tao 20051 assumptions. 
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In this paper we revisit some sufficient conditions for oracle inequalities for the Lasso 
in regression and examine their relations. Such oracle results have been derived, among 



others, by Bunea et al. 


2007c 




van de Geer 


(2008 


), 


Zhang and Huang 


(2008 


), 


Meinshausen 


and Yu 


(2009), 


Bickel et al. 


( 


2009 


), and for the related Dantzig selector by Candes and Tao 



(2007) and Koltchinskii (2009b I . Furthermore, variable selection properties of the Lasso 



have been studied by Meinshausen and Biihlmann (2006), Zhao and Yu (2006), Lounici 



(2008), Zhang (2009) and Wainwright (2009). Our main aim is to present an overview of 



the relations (of which some are known and some are new), and to emphasize that that 
sufficient conditions for oracle inequalities hold in fairly general situations. 

The Lasso, which we at first only study in a noiseless situation, is defined as follows. Let 
X be some measurable space, Q be a probability measure on X, and || • || be the L>2{Q) 
norm. Consider a fixed dictionary of functions {ipj}^ =1 C ^(Q), and linear functions 



Consider moreover a fixed target 



We let S := {j : $ / 0} be its active set, and s := \S\ be the sparsity index of /°. 
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For some fixed A > 0, the Lasso for the noiseless problem is 



(3* :=argmin|||^-/°|| 2 



+ A 



(1) 



where || • ||i is the ^i-norm. We write /* := fp* and let £** be the active set of the Lasso. 

Let us precise what we mean by an oracle inequality. With (3 being a vector in MP, and 
M C {1, . . . ,p} an index set, we denote by 

the vector with non-zero entries in the set N (hence, for example f3g = (3°). 

Definition: Sparsity constant and sparsity oracle inequality. The sparsity constant 
4>q is the largest value 4>o > such that Lasso with (3* and f* satisfies the ^o-sparsity oracle 
inequality 

n 



Restricted eigenvalue conditions (see Koltchinskii ( 2009a|b ) and Bickel et al. ( 2009| )) have 
been developed to derive lower bounds for the sparsity constant. We will present these 



conditions in the next section. Irrepresentable conditions (see Zhao and Yu (2006)) are 
tailored for proving variable selection, i.e., showing that S* = S, or, more more modestly, 
that the symmetric difference S*AS is small. 



1.1 Organization of the paper 

We start out with, in Section [2] an overview of the conditions we will compare, and some 
pointers to the literature. Once the conditions are made explicit, we give in Subsection 

2.2 a summary of the various relations. Figure [T] displayed there enables to see these 



at a single glance. We give a proof of each of the indicated (numbered) implications. 
Sections [3] - [9] rigorously deal with all the different cases. The weakest condition is a 
compatibility condition. Stronger conditions can rule out many interesting cases. We 



illustrate in Section 10 that one may check compatibility using approximations. We give 
several examples, where the compatibility condition holds. We also give an example where 
the compatibility condition yields a major improvement to the oracle result, as compared 



to the restricted eigenvalue condition. The noisy case, studied briefly in Section 11 poses 
no additional theoretical difficulties. A lower bound on the regularization parameter A 
is required, and implications become somewhat more technical because all further results 



depend on this lower bound. Section 12 discusses the results. 

1.2 Some notation 

For a vector v, we invoke the usual notation 

"{ 



1 9 



/(E.N 9 ) 17 ", !<?<oo 
q = oo 
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The Gram matrix is 




so that 

Wfpf = fvp. 

The entries of £ are denoted by Oj^ := (ipj>ipk)> with (•,•) being the inner product in 
L 2 (Q). 

To clarify the notions we shall use, consider for a moment a partition of the form 




where Ex,i is an N x N matrix, £21 is a (p — N) x N matrix and Si 2 := £21 1S its 
transpose, and where £2,2 is a (p — N) x (p — N) matrix. Such partitions will be play an 
important role in the sections to come. 

More generally, for a set J\f C {1, . . . ,p} with size N, we introduce the N x N matrix 

£1,1 (AO : = ( a j,k)j,ke^^ 

the (p — N) x N matrix 

£2,1 (AO = ( a j,k)j£tf,keM, 
and the (p — iV) x (p — iV) matrix 

£2,2 (AO := (<Tj,k)j,kW- 

We let A^ lin (Ei ) i(A/")) be the smallest eigenvalue of Ex,i(A0- Throughout, we assume that, 
for the fixed active set S, the smallest eigenvalue A^ lin (Si j i(5)) is strictly positive, i.e., 
that Si i i(S') is non-singular. 

We sometimes identify @j\f with the vector |AA|-dimensional vector {Pj}jeAf, and write e.g., 



2 An overview of definitions 



The definitions we will present are conditions on the Gram matrix S, namely conditions 
on quadratic forms /3 T S/3, where f3 is restricted to lie in some subset of W°. We first take 
the set of restrictions 



K(L,S) := {/?: 



Ii <L\ 



The compatibility condition we discuss here is from van de Geer (2007 1 . Its name is based 
on the idea that we require the £i-norm and the L2((5)-norm to be somehow compatible. 

Definition: Compatibility condition. We call 

s\ 



'compatible 



(L,S) 



mm 



: (3eK(L 
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the (L, »S)-restricted ^-eigenvalue. 

The (L, S^-compatibility condition is satisfied if (^compatible 

(L, S) > . 

The bound HAsHi < V^IIAslh (which holds for any (3) leads to two successively stronger 
versions of restricted eigenvalues. We moreover consider supsets TV of S with size at most 
N. Throughout in our definitions, N > s. We will only invoke N = s and N = 2s (for 
simplicity) . 

Define the sets of restrictions 

^ada P tivc(^,5) := {p : < yfiLWPsh}, 

and for Af D S, 

K(L,S,N):={f3£K(L,S): ||/3aH|oo < min \/3j\}, 



and 



^ada P tive(^,5',AA) := {/? e ^ adaptive (L,5) : ||/?aHIoo < min \f3j\}. 

jeM\S 



If N = s, we necessarily have M\S = 0. In that case, we let miiij G j^\ s \(3j\ = 0, i.e., 

K(L, S, S) = TZ(L, S) (Adaptive (L, S, S) = 7?. a daptive (L, £>)). 



The restricted eigenvalue condition is from |Bickel et al. (20091 and Koltchinskii (2009b I . 
We complement it with the adaptive restricted eigenvalue condition. The name of the 
latter is inspired by the fact that this strengthened version is useful for the development 
of theory for the adaptive Lasso (Zou, 20061 which we do not show in this paper. 

Definition: (Adaptive) restricted eigenvalue. We call 

<P 2 {L,S,N) : =min(J^E : N D S, \J\f \ < N, en(L,S,AT) 

the (L, S, iV)-restricted eigenvalue, and, similarly, 

^ daptivc (L,5,iV):=min||^: N D S, \AT\ < N, (3 G ^adaptive (L, S, A/")} 

the adaptive (L, S, iV)-restricted eigenvalue. The (adaptive) (L, S, A r )-restricted eigenvalue 
condition holds if 4>(L, S, N) > f0adaptive(-k> S, N) > 0) . 

We introduce the (adaptive) restricted regression condition to clarify various connections 
between different assumptions. 

Definition: (Adaptive) restricted regression. The (L, S, iV)-restricted regression is 

#(L,S,N) ^mJ 1 ^'^^ : N D S, \M\ < N, (3 G K(L, S, Af) 
The adaptive (L, S, iV)-restricted regression is 
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^adaptive (L, S, N) l— 

l{f ^ f j0 f }l : M d S, \Af\ < N, (3 G ^ adaptive (L, S,Ao}. 



max 



T7ie (adaptive) (L, <S, A^)-restricted regression condition /joWs i/ S, N) < 1 

^adaptivc(i,S,iV) < I). 



Note that (J^, /a^)/ 1| fp M \\ 2 equals the coefficient when regressing fp MC onto fp 



Of course all these definitions depend on the Gram matrix E. In Sections [10] and [TTJ we 
make this dependence explicit by adding the argument E, e.g. the (E, L, S^-compatibility 
condition, etc. 

When L = 1, the argument L is omitted, e.g. Compatible (S) := Compatible (1, S), and e.g., 
the S-compatibility condition is then the condition Compatible (&) > 0- The case L > 1 is 
mainly needed to handle the situation with noise, and L < 1 is of interest when studying 
the adaptive Lasso (but we do not develop its theory in this paper). 



We now present some definitions from Candes and Tao (20051. 
Definition: Restricted orthogonality constant. The quantity 

9(S,N):= sup sup sup n^raH 

MdS: \M\<N McAf c , \M\<s \\PN\\2\\PM\\2 

is called the (S, iV)-restricted orthogonality constant. We moreover define 

9 SjN :=max{9(S,N) : \S\ = s}. 



Definition: Restricted isometry constant. The N- restricted isometry constant is the 
smallest value of 5n such that for all N with \N\ < N, 

(i - Sn)\\MI< II/aJ 2 < (i + &N)\\Mi 



Definition: Uniform eigenvalue. The (S, N) -uniform eigenvalue is 

A 2 (S,iV):= inf A 2 ^ (E 1 1 (A/") ) . 

A/DS, \Af\<N 



As mentioned before, we always assume that A(S, s) > 0. 

Definition: Weak restricted isometry. The weak (S, A/")-restricted isometry constant 
%s 

,9 (Q »n KIN) 

Wweak-RIPtA^VJ .- ^ " 

The weak (L, S, ^-restricted isometry property holds if i? we ak-Rip(5' ; N) <1/L. 
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Definition: Restricted isometry property. The RIP constant is 



#rip ■'- 



?s,2s 



1-6. 



The restricted isometry property, shortly RIP, holds «/i?rip < 1. 



An irrepresentable condition can be found in Zhao and Yu (2006). We use a modified 
version which involves only the design but not the true coefficient vector f3° (whereas its 



sign vector appears in Zhao and Yu (2006)). The reason is that most other conditions 
considered in this paper do not depend on (3° as well. Our (L, S, iV)-irrepresentable con- 



dition with L = 1 and N = s is only slightly stronger than the condition in Zhao and Yu 



(2006). 



Definition: Irrepresentable condition. 
Part 1. We call 



^unrepresentable ( S, N) := min max || £ 2 ,1 (A/") £1 1 {N)tm \\ ^ 

AfDS: \Af\<N UtatIIoo^I 

the (S, iV)-uniform irrepresentable constant. The (L, S, iV)-uniform irrepresentable condi- 
tion is met, if ^irrepresentable (S, N) < l/L. 

Part 2. We say that the (L, S, iV)-irrepresentable condition is met, if for some N D S 
with \M\ < N, and all vectors tj^ satisfying Ttf £ {— 1> 1}' we have 

||E 2)1 (A^)Er j l(A0r A r||oc < l/L. 

Part 3. We say that the weak (S, iV)-irrepresentable condition is met, if for all ts £ 
{ — 1, 1} S , and for some M D S with \M\ < N, and for some Tj^\g £ { — 1, l}!"^"^, we have 



|E 2 ,i(A^)Erl(^)wl 



< 1. 



Finally, we present coherence conditions, which are in the spirit of Bunea et al. ( 2007b|c 
Cai et al. (2009b) derive an oracle result under a tight coherence condition. 



Definition: Coherence. The (L, S')-mutual coherence condition holds if 

fmutualWj •- A^TS s) 

The (L, S) -cumulative coherence condition holds if 



^cumulative ( 5) := \2f a \ ^ 1/-^- 



A 2 (5,s) 
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2.1 Implications for the Lasso and some first relations 



It is shown in van de Geer (20071 that the compatibility condition implies oracle inequal- 



ities for the Lasso. We re-derive the result for later reference and also for illustrating that 
the compatibility condition is just a condition to make the proof go through. We also 
show (again for later reference) the additional ^-result if one uses the (S, iV)-restricted 
eigenvalue condition. 

Lemma 2.1 (Oracle inequality) We have for the Lasso in |Ip, 

II/* - f\\ 2 + X\W*s4l < AV^ompatible^)- 

Moreover, letting J\f*\S being the set of the N — s largest coefficients |/3*|, j £ S°, 

II/%-/^JI<aVAW 



Proof of Lemma 2.1 The first assertion follows from the Basic Inequality 

\\r-f \\ 2 + MW*h<MW°\\i, 

using the definition of the Lasso in ([!]), which implies 

||/* _ / 0||2 +A || /3 * c || 1 < A A|^0|| i _ 
< X\\P* S -P° S \\l< Av^lir " /° ||/</>compatiblc(S). 

Note that the last inequality holds because (5* — /3° £ 1Z(S) which follows by its preceding 
inequality: 



3 s c IU 



?Jc-^c||l<||^-^l 



1- 



The second result follows from 

\\Pk-^Jl< 
and using ^compatible (S) > 4>(S,N). 



/o|| 2 /^ 2 (5,iV), 



□ 



An implication of Lemma 2.1 is an ^i-norm result: 



]*s4i+\W*s-/3sh 



< As/ 



compatible 



ible (5) + Av^lir - /I/Compatible^) 



< 



2As/ <A 2 ompat ible('S') ) 



where the last inequality is using the first assertion in Lemma 2.1. We also note that the 
second assertion in Lemma 2.1 has most statistical importance for the case with N = s. 



We will need the case N = 2s later in our proofs. 



Meinshausen and Buhlmann (2006 ) and Zhao and Yu (2006) prove that the irrepresentable 



condition is sufficient and essentially necessary for variable selection, i.e., for achieving 
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S 1 * = S. We will also present a self-contained proof in Section [6] where we will show 
that the (S, s)-irrepresentable condition is sufficient and the weak (5, s)-irrepresentable 
condition is essentially necessary for variable selection. 



Bickel et al. (20091 prove oracle inequalities under the restricted eigenvalue condition. 



They assume 

mm{(f>{L,S,s) : \S\ = s} > 
(where L can be taken equal to one in the noiseless case). 

The restricted isometry property from Candes and Tao ( |2005 ), abbreviated to RIP, also 
requires uniformity in S. They assume the RIP 

#rip < 1. 

They show that the RIP implies exact reconstruction of j3 from /° by linear programming 



(that is, by minimizing subject to \\f/3-f°\\ = 0). |Caiet al.|(|2009a[) prove this result 
assuming 5n + 6 Sj n < 1 for N = 1.25s only; see also Cai et al. ( |2009[ ) for an earlier 
result. It is clear that 1 — 5n < A 2 (5, TV), i.e., the restricted isometry constants are more 
demanding than uniform eigenvalues. Candes and Tao (2005) furthermore show that 

#wcak- RIP (-5,^0 < ^RIP- 

See also Figure [TJ They prove that the RIP is sufficient for establishing oracle inequalities 
for the Dantzig selector. Koltchinskii (2009a) and jBickel et al. (2009) show that 

<KL,S,2s) > (1 -M wcak „ mP (S,2s))A(S,2s). 

Thus, the weak (5, 2s)-restricted isometry property implies the (S, 2s)-restricted eigen- 
value condition. See also Figure [T} 

Bunea et al. ( 2007a|b|c ) show that their coherence conditions imply oracle results and 



refinements (see also Section [4] for their condition on the diagonal of S). Candes and 



Plan (2009) weaken the coherence conditions by restricting the parameter space for the 



regression coefficient (3. 

Finally, it is clear that adap tive(£, S, N) < <f>(L,S,N) < Compatible (^, S), i.e., 

adaptive restricted eigenvalue condition =>• 

restricted eigenvalue condition =>• 

compatibility condition. 
See also Figure [T] 

It is easy to see that S, N) and ??adaptive(^, S, N) scale with L, i.e., we have 

tf(L, S, N) = M(S, N), ^ adaptivc (L, S, N) = Ltf adaptivc (S, N). 

This is not true for the (adaptive) restricted (^i-)eigenvalues. It indicates that the (adap- 
tive) restricted regression is not well-calibrated for proving compatibility or restricted 



8 



eigenvalue conditions, i.e, one might pay a large price for taking the route to oracle results 
via restricted regression conditions. 



We end this subsection with the following lemma, which is based on ideas in Candes and 
Tao (2007). A corollary is the ^2-bound given in ([2]), which thus illustrates that considering 
supsets M of S can be useful. However, we use the lemma for other purposes as well. 

We let for any /?, rj(f3) := rank(|/3j|), j G S c , if we put the coefficients in decreasing order. 
Let Ao(/3) be the set of the s largest coefficients in S c : 

M ((3) := {j : rj ((3)e{l,...,s}}. 

Put M(f3) := A/q(/3) U S. Further, assuming without loss of generality that p = (K + 2)s 
for some integer K > 0, we let for k = 1, . . . , K, 

N k {(5) := lj: r j {f3)e{ks+l,...,{k + l)s}\. 

We further define 

AC :=AA(/T), A/ - ,* :=A4(/?*), k = 0, 1, . . . , K. 

Lemma 2.2 We have for any any r > 1, and 1/r + 1/q = 1, and any (3, and for N := 
J\f(/3), and A4 := A4(/3), k = 0, 1, . . . , K, the bound 

K 



\\M\r <^II^J|r<||^||l/s 1/9 . 



k=l 



Corollary 2.1 Combining Lemma 2.1 with Lemma 2.2 gives 

H/r-/?°ll!<2AV0 4 (s,2 S ). 



(2) 



This result is from Bickel et al. (2009). The proof we give is essentially the same as theirs. 



Proof of Lemma 2.2\ Clearly, 



K 



K 



k=l 



k=l 



We know that for k = 1, . . . , K, 



and hence, 
It follows that 



\Pj\ < ll/^-JllA, j G A4, 

ii/^ji;<-- ( ^ 1) ii/v»_ 1 ii;. 

K 



K 



k=l 



k=l 



□ 
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oracle inequalities for prediction and estimation 

RIP =^weak (S,2s)- RIP =^> adaptive (S, 2s)- ==\ (S,2s>restricted || 2 

restricted regression eigenvalue 

— ^-compatibility 

^ 4 3 Z^Z\ 

\S t \S\ < s coherence == ^>adaptive (S, s)- —X (S.i^-restricted '/ // 
,- restricted regression eigenvalue 

.6 ,6 '/6 

weak (S, 2s)- <== (S.i^-irrepresentable ^= (S,s)-uniiovm — \ \S t \S\ =0 

irrepresentable irrepresentable - ^S„=S 

Figure 1: A double arrow (=>) indicates a straight implication, whereas the more fancy 
arrowheads mean that the relation is under side-conditions. The numbers indicate the 
section where the result is (re)proved. 

2.2 Summary of the results 

The following figure summarizes the results. 

Our conclusion is that (perhaps not surprising) the compatibility condition is the least 
restrictive, and that many sufficient conditions for compatibility may be somewhat too 



harsh (see also our discussion in Section 12 ) 



3 The restricted regression condition implies the restricted 
eigenvalue condition 

We start out with an elementary lemma. 

Lemma 3.1 Let f\ and fi by two functions in L2{P). Suppose for some < •& < 1. 

-(/i,/ 2 )<tf||/i|| 2 . 

Then 

(l-^ll/ill < ll/i + M 

Proof. Write the projection of f% on f\ as 

fli ■= (/2,/i)/||/i|| 2 /i- 

Similarly, let 

/ = (/l + / 2 )f :=(/,/l)/||/l|| 2 /l 
be the projection of f\ + fa on f\. Then 

(/i + h)\ =h+ fii = ( i + (/ 2 , /i)/n/ifVi. 
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so that 



IK/1 + /2) 



l + (/ 2 ,/i)/ll/: 



|2 



\\h 



l + (/2,/l)/||/l|| 2 J||/l||>(l-^)||/l|| 

Moreover, by Pythagoras' Theorem 

ii/i+/ 2 ii 2 > n(/i+/ 2 )rii 2 . 

□ 

It is then straightforward to derive the following result. 
Corollary 3.1 Suppose that$(S,N) < 1/L. Then 

2 

2/ 



<p 2 (L,S,N)> ll-Ld(S,N)j A 2 {S,N). 

A similar result is true for the adaptive versions. In other words, the (adaptive) restricted 
regression condition implies the (adaptive) restricted eigenvalue condition. 



4 5-coherence conditions imply adaptive (S, s)-restricted re- 
gression conditions 



Bunea et al. ( 2007a|b|c ) establish oracle results under a condition which we refer to as 



the restricted diagonal condition. They provide coherence conditions for verifying the 
restricted diagonal condition. 

Definition: Restricted diagonal condition. We say that the S-restricted diagonal 
condition holds if for some constant tp(S) > 

£ - (^(5)diag(t 5 ) 

is positive semi-definite. Here i := (1, . . . , 1) T (so lj s = l{j € S} ). 

We now show that coherence conditions actually imply restricted regression conditions. 
First, we consider some matrix norms in more detail. Let 1 < q < oo, and r be its 
conjugate, i.e., 

1 1 

- + - = 1. 

q r 

Define 

HE^CAOIk, == sup ||£ 1)2 (A0/3aH| 2 . 

\\foc\\r<l 

Some properties. The quantity 1 1 x,2 C^O 1 1 2 2 * s ^ e ^ ar 9 es t eigenvalue of the matrix 
Si,2(AA)S2,i(AA). We further have for 1 < q < oo, 

UN V V keN 
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and similarly for q = oo, 



|£i )2 (A0||2^<m^J^ fc . 



Moreover, 

||£i,2(A0l| 2 , g > ||Si, 2 (AA)|| 2i00 , 
so for replacing ||Si,2(-A/")||2,oo by ||Xi,2(AT)||2,<j; q < oo, one might have to pay a price. 
Lemma 4.1 For all 1 < q < oo, the following inequality holds: 

tf„ , (S2s)< max ^#Q^ 

Moreover, 

"adaptive W > »J ^ A 2 (S's) 



Proof of Lemma 4.1\ Take r such that 1/g + l/r = 1. Let AT D S, with |AA| = s and 



let /3 G ^ ada pti ve (S^Af) 

We let := ffa, fN c ■= fp U c- 
We have 

l(M/^)l = l^s li2 (,AO/5!v-« 

la- 



< ||S ll2 (JS0ll2, g ||^| 
Applying Lemma |2 , 2 1 gives 

\\M\r < \\PsA\i/s l,q < ^Wsh/s 1/q < ^WNh/s 1,q . (3) 

This yields 

\(fMj^)\<V~s\\X h2 (S)\\ 2 JM\ 2 2/s 1/q 
<V~s\\E lj2 (S)\\ 2 JfM\ 2 2/(s^A 2 (S,2s)). 

Similarly, 

Kfs,fs°)\ < ||Si I a(5)|| 2 , 00 ||/35«=l|i||^l|2 
< ^s\\X lt2 (S) \\ 2<0O \\(5 S \\ 2 2 < v ^||E 1 , 2 (S)|| 2 , 0O /A 2 (5 >a ). 

□ 



One of the consequences is in the spirit of the mutual coherence condition in Bunea et al. 



(2007b I. 



Corollary 4.1 (Coherence with q = oo) We have 



^adaptive (S, s) < A 2 (5 s) ~ ^mutual {&) ■ 
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With q = 1 and N = s, the coherence lemma is similar to the cumulative local coherence 
condition in Bunea et al. (2007c). We also consider the case N = 2s. 



Corollary 4.2 ( Coherence with q = 1 ) We have 

^adaptive (5') ■s) — ^cumulative \$) j 

and 



•&(S, 2s) < max 



Mds, \Af\=2s y/sA 2 (S,2s) 

The coherence lemma with q = 2 is a condition about eigenvalues (recall that ||Si 2(AT) ||| 
equals the largest eigenvalue of Si ! 2(A/")S2,i(AT)). The bound is then much rougher than 
the one following from the weak (S, 2s)-restricted isometry condition, which we derive in 
Lemma 17.11 



2 



Corollary 4.3 ( Coherence with q = 2) We have 

a fa n n ^ 1 1 Si 2 (A/") 1 1 2,2 

^adaptive(>5,2s) < max ■ 
A/DS, |Af|=2s A 2 (5, 2s) 

5 The adaptive (S, s)-restricted regression condition implies 
the (S, s)-uniform irrepresentable condition 

Theorem 5.1 We have 

^unrepresentable (5*) s) ^ ^adaptive ( 5" j s). 

Proof of Theorem 15.11 First observe that 

||S 2 , 1 (5)Srl(5)T 5 ||oo = sup \f3^ 2 ,l(S)^\(S)T S \ 

||&Hli<i 
= sup \(fp SB ,fbs)l 

llfec||l<l 

where 

65 := S^(5)r 5 . 

We note that 

>/i||6s||2 ||Si, 1 (5)6 s || 2 ||6 s || 2 " ' 

(Use Cauchy-Schwarz inequality for bounding the first factor). Furthermore, for any 
constant c, 

SUP \(fp s afb 8 )\= SUp \(f/3 S o,fb S )\/c 
Il/M|l<l ll/VI|l<c 
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Take c = V^IIAslh to find 

||S 2j1 (5)S^(5)t 5 |U = 

< sup 



gup \(fPsofbs)\ 
||/3 sc ||i<V5||& s || 2 V~s\\b S \\2 

\(f/3 S c A s ) I 



\\Psc\\i<V^\\bsh 



\\fbs\ 



□ 



6 The (5, s)-irrepresentable condition is sufficient and essen- 
tially necessary for variable selection 

An important characterization of the solution (3* can be derived from the Karush-Kuhn- 



Tucker (KKT) conditions which in our context involves subdifferential calculus: see Bert- 



simas and Tsitsiklis (1997 1 . 

The KKT conditions. We have 

2£(/3* -/9°) = -At*. 

Here ||t* ||oo < 1; oxid moreover 

r;i{/3*^0} = sign(/3*), j = l,...,p. 

For J\f D S 1 , we write the projection of a function / on the space spanned by {V'jjieA/' as 
and the anti-projection as f N := f — f Pjsf . Hence, we note that 

and thus 



Moreover 

ll(//3^) AAf II 2 = /3£c£ 2j2 (A0/^c - / 9^cE 2>1 ( < A0Si;l(J\0Ei I 2(W^ 
Lemma 6.1 Suppose T,^\(Af) exists. We have 

2||(fe)^H 2 = A(/3> c ) T S 2il (AA)S r j(AA)r> - 

Proof of Lemma 6.1[ By the KKT conditions, we must have 

2Ei.iCAO09.Ar - fir) + 2S 1>2 (AA)/?> C = -Arfr, 
2E 2>1 (AA)(/?> - 0%) + 2E 2 , 2 (AA)/?> C = -Ar>. 
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□ 



It follows that 

2(/fr - Plf) + 2E^(A0Si )2 (^)/3^c = -A£rJ(A0r^ 

2E 2 ,i(JV)(/3> - ^) + 2S 2 , 2 (AA)/3> C = -Ar^ c 

(leaving the second equality untouched). Hence, multiplying the first equality 
by -(/3>c) T S 2 ,i(AA), and the second by (/?^ C ) T , 

-2(/?> c ) T S 2 ,i(AA)(/3> - fa) - 2(/3> c ) T S 2 , 1 (AA)Srj(AA)S 1 , 2 (AA)/3> c 
= A(/?> c ) T S 2il (AA)Sr i i(AA)r>, 

2(/3> c ) T S 2 ,i(AA)(/3> - /?aV) + 2(/?> c ) T £ 2 , 2 (A0/?> c = -A||^c||i, 
where we invoked that f3*T* = |/3*|. Adding up the two equalities gives 

2(/3>c) t £ 2i2 (A0/?a/- c " 2(/3> c ) T S 2il (AA)Srj(AA)S 1>2 (AA)/3> c 

= X(^ a ) T E 2>1 (M)E^)r^ - A||/3> ||i. 

We now connect the irrepresentable condition to variable selection. Define 

|/3°| m in := min{\$\ : j G 5}. 

Lemma 6.2 

Part 1. Suppose the (S, N)-uniform irrepresentable condition holds. Then \S*\S\ < N—s. 
Part 2. Suppose the (S, N) -irrepresentable condition holds and 

l/^minl > ^/^compatible^)- 

Then 5* D S and |5*| < N. 

Part 3. Conversely, suppose that S* D S and \S*\ < iV, and A(S,N) > 0. T/ien 

ll^2,l(£0^i,i(^*) T sJloc < 1. 

If moreover 

|/3°|min>Av^/(2A(5,iV)), 

i/ien = To , where Tg := sign(/3<l ). 

A special case is N = s. In Part 1, we then obtain that S* C S, i.e., no false positive 
selections. Moreover, Part 2 then proves = S and Part 3 assumes S* — S. 

Proof of Lemma 16.21 

Part 1. Let J\f D 5 be a set of size at most TV, such that 

sup ||£ 24 (Af)£rl(A0w||oc < !• 
I|ts||oo<i 
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By Lemma 6.1, we now have that if ||/3^ c ||i > 

2||(/*)^|| 2 = A(^ c ) T S 2)1 (A0Srj(A0r^ - XWPUh < 0, 
which is a contradiction. Hence ||/3jy- c ||i = 0, i.e., S* C M. 



Part 2. By Lemma 2.1 



11$ -$° S \\l< V^lir - /Incompatible^) < Xs / <p 2 comp! , tihlc (S) . 

The condition |/3 min | > •^/'/'compatible ('-') thus implies that 5* D 5, and hence that 6 
{— 1, 1} S . We also know that r| € {—1,1}. Hence for any A/" satisfying 5 C J\f C 5 1 *, also 
tjv 6 {— 1,1}^. Thus, by the (S, A)-irrepresentable condition, there exists such an M, 
say A/ - , with 

||E2,i(a7)i;i;}(a7)t^||oo < 1. 

As in Part 1, we then must have that ||/3j~.J|i = 0. 

Part 3. Because A(S,N) > 0, and < N, we know that exists. Because 

5* D S, we have (5* Sc = /3g c = 0, so the KKT conditions take the form 

2E 1)1 (&)( / 9S.-/3§J = -Ar!., 

and 

2£ 2j1 (S*)(/^-/?°J = -AtJc. 

Hence 

$.-)9g. =AEr,i(^)r^/2, 
and, inserting this in the second KKT equality, 

£ 2) i(5*)£^ 1 (5*) r l* = r 5- 

But then 

||£2,l(5'*)£i j i('S'*)' r s t ||oo = ll T 5c||co 5; 1- 

The first KKT equality moreover implies 



Ws.-0s.h< XVN/(2A\S,N)). 
So when \(3°\ min > Aa/A /(2A 2 (S, A)), we have t£, = t£. 



□ 



7 The weak (5, 2s)-restricted isometry property implies the 
(S, 2s)-restricted regression condition 

Lemma 7.1 We have 

"^adaptive (S, 2s) < ^wcak-RIP (S, 2s) . 
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Proof of Lemma 7.1 . Let (3 be an arbitrary vector, satisfying [l/^cHi < \/s||/3s , ||2- From 

K 



Lemma 2.2 



< 



JS C \\1 



s < 



J S\\2- 



k=l 



Hence, using the definition of the restricted orthogonality constant 6(S,2s), and of the 
(5, 2s)-uniform eigenvalue A 2 (S, 2s), 



K 



|(/^,/^ c )| < 2^)^11^11211/3^ lb < 0{S,2s 



2 Ps a 



k=l 



or 



< 0(S,2s)\\fpJy^(S,2s), 



lOW/V*)! < 0(jS)2s)/A 2 (5)2s) = ,? wcak _ RIp (5,2 S ). 



□ 



Corollary 7.1 Together with Corollary 3.1 we can now conclude that when $ W eak-Rip(<S', 2s) < 
1/L, one has 

4>(L,S,2s) > (1 - L^ wcak _ R ip) 2 A 2 (5,2 S ). 



This result is from Koltchinskii (2009a ) and Bickel et al. (2009)) 



8 The restricted isometry property with small constants im- 
plies the weak (5, 2s)-irrepresentable condition 

We start with two preparatory lemmas. Recall that 

tfwcak-RIp(<M =9(S,S)/A 2 (S,S). 

Lemma 8.1 Suppose that 



Then 



weak- RIP (S^ s) < 1. 

2||(/^ c ) As || 2 < K e ^-mp(S,s)(\^\\^\\ 2 \ 

where As denotes the anti-projection defined in Section^ 
Proof of Lemma 18.11 Define 

6 S := Si,!^)- 1 ^. 

Then 

\\b s \\<\\r* s \\ 2 /A 2 (S,s)<V~s/A 2 (S,s). 



17 



Moreover, 

K-l 

K^ c ) T s2,isr,i(5)^i = \(fi3* c ,fb s )\ < £ \(H,Jb S )\ 

k=0 k 

K I K N 

< 0(5,.) £ ||/^*|| 2 ||M|2 < e(5, a )[|6 5 [| 2 ll%l| 2 + £ ||%|| 2 

fc=0 V fc=l / 



< 



A 2 (5,s) v n ^o l|Z A 2 (5*,s) 

^weak— 

Thus, 

(^c) T S 2 ,iSi;l(S)r5-||/9Sc||i 

$weak— RIP(5', s)v^||/3a/ * lb _ (1 — weak- RIP (£*, s)) ||/3<jc||l 

< ^ wea k-Rip(5,s)\/i||^v-*||2. 



Hence, by Lemma 6.1 



2\\(fp*J As \\ 2 < ^weak-RIP^s)! X^\\(3* K \\ 2 



Lemma 8.2 Suppose that 

^weak-RipC^s) < 1. 
Then for any subset N C S c , with \J\f \ < s, and any b £ MP 

^•f - /0)l £ mMkj) ( s(s ' s) + \Z< 1+4 »w*'^) iiw 

Proof of Lemma 18.21 We have 

K/^,r-/ )i<i(/^,(r-/°) Ps )i + i(/^,(r) As )i 

Let us write 

(/* - f°f s := / 7S - 



Then, invoking Lemma 2.1 



Wish < ll/ 7s ||/A(S, S ) = IK/* -f°) Ps \\/A(S,s) < ||r-/°||/A(5, S ) 



< Av^/( <j>(S,2s)A(S,s) 

It follows that 



18 



i(/^,(r-/T s )i<^»^)iiM2ii75ii2 



Moreover, we have 



So, by Lemma 8.1 



<9(S,s)\\b^hX^/[cP(S,2s)A(S,s) 



< \ 2 s6(S, s) I 2c/> 2 (S, 2s)A 2 (S, s) . 



Therefore 



\(f K -An As )\<\\h 



*\Ac 



<\V~sV0(S,s)/2/(cj ) (S,2 S )A(S,s))\\f l 



1 



The next result shows that if the constants are small enough, then there will be no more 
than s false positives. We define 



a(S) 

Lemma 8.3 Suppose that 
Then \S*\S\ < s. 



(V29(S, s) + V / (1 + 8 S )6(S, s) 
(f>{S,2s)A{S,s) 

a(S) < 1. 



(4) 



Proof of Lemma 



8.3 



Since a(S) < 1, Lemma 8.2 implies that for any M C S c , with 
\M\ < s, and for any b with ||bjy-||2 7^ 0, 

l(/^r-/o)|<A^72||6^|| 2 . 
Hence, taking b 3 = f* - f°), j G M, 

Ei(^,r-/°)r<AV2- 

For j € we have by the KKT conditions 

i2(^,r-/°)i>A. 
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Suppose now that |5*\5| > s. Then there is a subset J\f' of 5*\5, with size \M'\ = s, and 
we have 

X 2 s/2> £ l(^,r-/°)| 2 > A 2 |AHA 
jeAf' 

This is a contraction, and hence |5*\5| < s. □ 
This leads to the following result. 

Theorem 8.1 Suppose that a(S) < 1, see |^p. Then the weak (S,2s)-irrepresentable 
condition holds. 



Proof of Theorem 8.1. As a(S) < 1, we know that c/>(5, 2s) > 0. Take an arbitrary 
r£ G {-1, 1} S , and a /3 satisfying (3% = j3°, sign(^) = rg, and 

|/?°Un>Av^/0 2 (<S,2 S ). 



By Lemma 2.1, the Lasso satisfies 

Ws-(5lh<\J~s/^\S,2s)- 



Hence, we must have 5* D 5, and = r^. Moreover, by Lemma 8.3 15*1 < 2s. By Part 
3 of Lemma 6.2 we must have 

II^2,i(<S'*)£ 1) i(5*)t,s.J| 00 < 1- 

Since Tg = Tg is arbitrary and t$ £ {— l,!}' 5 *', we conclude that the weak (5,2s)- 
irrepresentable condition holds (in fact the weak (5, 2s — 1 Unrepresentable condition 
holds). 



□ 



Corollary 8.1 The RIP is the condition ifpjp < 1, or equivalently 

$s + 9 S s + 9s,2s < 1- 



Candes and Tao (2005) show that 82s < 9 S + 5 S . The restricted isometry constant 5 S 



has to be less than one, so we may use the bound 1 + 5 S < 2. Moreover, it is clear that 
6(S,N) < 9 Sj n, and A 2 (5, iV) > 1 — 5n. Inserting these bounds in Corollary 7.1 we find 



<j)(S, 2s)A(5, s)>(l-6 s - e StS - e St2s ) 
It follows that 



1-5, 



1 - 5, - ft, 



>{i-s s 



7s,2s ■ 



a{S) < 



1 — 8 S — 9 S: s — 9 Sj 2 S 
For example, if 5 S < y/2 — 1 and 6 s ,2s < jq, we get (invoking 8 S;S < 9 s ^s) 

a{S) < 0.96. 

We conclude that the RIP with small enough constants implies the weak (5, 2s)-irrepresentable 
condition. 
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As Candes and Tao ( 2005| ) show, the RIP implies exact recovery. To complete the picture, 
we now show that the (S, s)-irrepresentable condition also implies exact recovery. 



The linear programming problem is 



mm 



i: ll//J-/°[|=0}, 



where, as before /° = /«o with j3° = (3g. Let /3 LP be the minimizer of the linear program- 
ming problem. 

Lemma 8.4 Suppose the (S, s)-irrepresentable condition holds. Then one has exact re- 
covery, i.e., P LP = (3°. 



Proof of Lemma 8.4. This follows from Candes and Tao (20051. They show that 
/3 LP = 0° if one can find a g G L,2(P), such that 

(i) (il> j ,g) = Tf, for alliGS 1 , 

(ii) |(^, 5 )| < 1 for allj^S, 

where, as before, := sign(/3c). The (S, s)-irrepresentable condition says that this is true 
for g = f bs , where b s = S^{(S)r^. □ 



9 The (S, s)-uniform irrepresentable condition implies the 
5-compatibility condition 

As the (S, s)-irrepresentable condition implies variable selection, one expects it will be 
more restrictive than the compatibility condition, which only implies a bound for the 
prediction error (and ^i-estimation error). This turns out to be indeed the case, albeit we 
prove it only under the uniform version of the irrepresentable condition. 

Theorem 9.1 Suppose that 

^irrepresentable ( S, s) <C 1/ L. 

Then 

^compatible (-^> — (1 ~~ -Unrepresentable ( S, s) ) 2 \ 2 (S, s) . 



Proof of Theorem 9.1 Define 



f := argminlll/^ll 2 : \\(3 s \\l = 1, < L}. 

Let us write /° := ft, /| := and fg c := fp% c - Introduce a Lagrange multiplier A G R 
for the constraint ||/3 s ||i = 1- By the KKT conditions, there exists a vector t%, with 
H r |||oo < lj such that tJ/3| = ||/3g||i, and such that 

S 1)1 (5)/3§ + Si, 2 (S)^ e = -Ar|. (5) 
By multiplying by (l3g) T , we obtain 



fill 2 + (/!,/!*) = -AH/? 1 



slli- 
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The restriction = 1 gives 



ll/IH 2 + (/|,/| c ) = -A. 



We also have from (JsJ) 



/3§ + E7i(5)Ei, 2 (5)^c = -AErir| 



Hence, by multiplying with (r! 



o\T 



Jill! + (r|) T Sr}(5)S li2 (5)/3| c = -A(r|) T S r jrf 



S) 



(6) 



or 



1 = -(r|) T S^(S)S li2 (5)^ c - A(r|) T S^(S)r| 
^^ll/^ellx-A^lfSr^^rl 
<M-A(r|) T S r j(5)r|. 

Here, we applied that the (S, s)-uniform ir represent able condition, with i? = i9irrepresentabie(>S'> s) 
and the condition ||/3,gc||i < L. Thus 

l-Ltf<-\(T$) T Z^(S)Tl 

Because 1 — Li? > and (Tg) T T,^\(S)Tg > 0, this implies that A < 0, and in fact that 

(1 - Li?) < -Xs/A 2 {S,s), 

where we invoked 

(r|) r Srj(5)r| < \\T» s g/A\S,s) < s/A 2 (S,s). 

So 

-A> (1- Ltf)A 2 (S,s)/s. 
Continuing with ([6]), we moreover have 

(/^ c ) T £ 2 ,i(S)/?s + (^ c ) T S 2il (5)Sr il (5)S 1 , 2 (5)/3^ 

= -A(^ c ) T S 2 ,i(S)Srj(<5)^- 



In other words, 



(r s ,r S c) + \\(r S c) 



A(/3| c ) T E 2jl (5)E7!(5)r| 



-l. 



where (/| c ) Ps is the projection of /| c on the space spanned by {ipk}k&s- Again, by the 
(5, s)-uniform ir represent able condition and by 1 1 1 1 1 < L, 

(^ c ) T S 2 ,i(S)S7l(5)r| < ^H^lli < Ld, 



so 



-A(^ c ) T S 2il (5)S r> l(5)r| = |A|(^ c ) T S 2 ,i(5)S r j(5)r| 
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It follows that 



(/3| c ) T S 2 , 1 (5)Srl(5)r| > -\X\L& = ALtf 



\\r\\ 2 = \\m 2 +2Uhfh) + \\fU 2 

= -A+(/!,/|c)+||/|cf 

> -A + {fh /Ic) + WUs^sf > -A + ALtf = -A(l - Li?) 
> (l-L 1 ?) 2 A 2 (S,s)/s. 
Finally note that ||/<>|| 2 = C mpatible (L, 5)/a. □ 

10 Verifying the compatibility and restricted eigenvalue con- 
dition 

In this section, we discuss the theoretical verification of the conditions. Determining a 
restricted -^-eigenvalue is in itself again a Lasso type of problem. Therefore, it is very 
useful to look for some good lower bounds. 

A first, rather trivial, observation is that if £ is non-singular, the restricted eigenvalue 
condition holds for all L, S and N, with (p 2 (L,S,N) > A 2 nin (S), the latter being the 
smallest eigenvalue of £. If £ is the population covariance matrix of a random design, i.e., 
the probability measure Q is the theoretical distribution of observed co- variables in X, 
assuming positive definiteness of £ is not very restrictive. We will present some examples 



in Section 10.2| Compatibility conditions for the population Gram matrix are of direct 



relevance if one replaces L2-I0SS by robust convex loss (van de Geer 20081. But, as we will 
show in the next subsection, even if £ corresponds to the empirical covariance matrix of 
a fixed design, i.e., the measure Q is the empirical measure Q n of n observed co-variables 
in X, the compatibility and restricted eigenvalue condition is often "inherited" from the 
population version. Therefore, even for fixed designs (and singular E), the collection of 
cases where compatibility or restricted eigenvalue conditions hold is quite large. 



10.1 Approximating the Gram matrix 

For two (positive semi-definite) matrices £0 and £1, we define the supremum distance 

doo(£i, £ ) := max I (£].)_,• & - (£okfe|- 

Generally, perturbing the entries in £ by a small amount may have a large impact on the 
eigenvalues of £. This is not true for (adaptive) restricted i\ -eigenvalues, as is shown in 
the next lemma and its corollary. 

Lemma 10.1 Assume 

doo(£ij £0) < A. 
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Then V G K(L,S), 



\\M\k 1 

ll/^lllo 



< 



(L + l) 2 As 



^compatiblc( S 0,^,'S')' 

and similarly, V D 5, |jV| = AT, and V /3 G TZ(L, S,N), 

(L + l) 2 As 



2 

Si 



2 

So 



< 



^ 2 (S ,L,5,iV)' 
and\/ MD S, \M\ = N, and V /? G ^ adaptivc (L, S", AO, 

(L + l) 2 As 



ll^lllo 



Proof of Lemma 10. 1[ For all /3, 



< 



Wive(S0,L,5,iV)' 



Ill/zslli-ll^lllohl^Si/^-^So/?! 



= (Si - S )/3| < A[| MM1 . 
But if (3 G 1Z(L,S), it holds that ||/% c ||i < L||/%||i, an d hence 

|l < (L+ 1)11/3511! < (L + l)||/ / 3||soV / ^/ ( / ) compatiblc(So, L, S). 



This gives 



- WMU < (L + l) 2 A||^|| 2 oS /^o mpati bie(So,AS). 



The second result can be shown in the same way, and the third result as well as for 
G ^adaptive {L, S, jV), it holds that \\f3s c \\i < Ly/s\\(3sh, an d hence 

< Ly/i\\Psb + WPsh <(l + i)V~s\\Psh. 



□ 

Corollary 10.1 We have 

^compatible (El, L, S) > (^compatible (^0, L, S) — (L + doo(Eo, Ei)s. 

Similarly, 

<f>(Pi,L, S, N) > 0(E O , L, S, N)-(L + 1) \AUEq, Ei)s, 
and t/ie same result holds for the adaptive version. 

Corollary |10.1 shows that if one can find a matrix So with well-behaved smallest eigenvalue, 
in a small enough ^^-neighborhood of E^, then the restricted eigenvalue condition holds 
for Si. As an example, consider the situation where ipj(x) = xj (j = 1, . . . ,p) and where 

E := X T X/n = (a jtk ), 

where X = (Xij) is a (n x p)-matrix whose columns consist of i.i.d. A/"(0, l)-distributed 
entries (but allowing for dependence between columns). We denote by E the population 



24 



covariance matrix of a row of X. Using a union bound, it is not difficult to show that for 
all t > 0, and for 

- j 4t + 8logp | 4t + 81ogp 

V n n 

one has the inequality 

P^ oc (S,S) > A(t)^ < 2exp[-t]. (7) 

This implies that if the smallest eigenvalue A^ in (E) of E is bounded away from zero, and 
if the sparsity s is of smaller order o(y / n/logp), then the restricted eigenvalue condition 
holds with constant (p(S, N) not much smaller than A m i n (E). The result can be extended 
to distributions with Gaussian tails. 



10.2 Some examples 

In the following, our discussion mainly applies for E being the population covariance ma- 
trix. For E being the empirical covariance matrix, the assumptions in the discussion below 
are unrealistic, but as seen in the previous section, the population properties can have im- 
portant implications for the restricted eigenvalues of the empirical covariance matrix. 

Example 10.1 Consider the matrix 

E := (l-p)I + pu T , 

with < p < 1, and i := (1, . . . , 1) T a vector of 1 's. Then the smallest eigenvalue of E is 
A min (E) = 1 — p, so the (L, S, N) -restricted eigenvalue condition holds with <fi 2 (L, S, N) > 
1 — p. The uniform (S, s) -irrepresentable condition is always met. The largest eigenvalue 
of E is (1 — p) + pp. Hence, the restricted isometry constants S s are defined only for 

P < 1/(8-1). 

Example 10.2 In this example, E is a Toeplitz matrix, defined as follows. Consider a 
positive definite function 

R(k), keZ, 

which is symmetric (R{k) = R(—k)) and sufficiently regular in the following sense. The 
corresponding spectral density 

oo 

/s P cc(7) := ^0) ex PH fc 7) (7G [— tt, tt]) 

k=— oo 

is assumed to exist, to be continuous and periodic, and 

70 := argmin/ spcc (7) 

76[0,tt] 

is assumed unique, with /(70) = M > 0. Moreover, we suppose that / spC c( - ) is (2a) 
continuously differ entiable at 70, with /^ 2a ^(7o) > 0. A Toeplitz matrix is 

E = (<7 j>fc ), a j;k := R(\j - k\), 
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where R(-) satisfies the conditions described above (in terms of the spectral density). A 
special case arises with aj t k = p\i~ k \ for some < p < 1. The smallest eigenvalue A^ in (S) 
of £ is bounded away from zero where the bound is independent of p (Parter, 1961). 

Example 10.3 Consider a matrix £ which is of block structure form: 

£ = diag(£i, . . . , £ fc ), 

where the Ej are (m x m) covariance matrices (j = 1, . . . , k) (the restriction to having the 
same dimension m can be easily dropped) and km = p. If the minimal eigenvalues satisfy 

minA min (S J ) > V 2 >0, 

J 

then the minimal eigenvalue o/S is also bounded from below by rf > 0. When m is much 
smaller than p, it is (much) less restrictive that small mx m covariance matrices T,j have 
well-behaved minimal eigenvalues than large p x p matrices. 

Example 10.4 This example presents a case where the compatibility condition holds, but 
where the uniform irrepresentable constant is very large. We also calculate the adaptive 
restricted regression. Let the first s indices {1, . . . , s} be the active set S and suppose that 

I £l,2 
^2,1 ^2,2 



£ : = 

where I is the (s x s)-identity matrix, and 



£2,1 := P{^b\), 

with < p < 1, and with b\ an s-vector and 62 a (p — s)-vector, satisfying H&1II2 = HMI2 = 
1. Moreover, £2,2 * s some (p — s) x (p — s)-matrix, with diag(S2,2) = I, and with smallest 
eigenvalue A^ in (S2,2)- One easily verifies that 

A min (S) > A min (S 2 ,2)-p. 

Moreover, for b\ := (1,1,..., 1) T / ^fs and 62 := (1,0,..., 0) T , and p > 1/^/s, the (S, s)- 
uniform irrepresentable condition does not hold, as in that case 

sup ||S 2 ,i(S')E^ 1 (5)rs||oo = pVs. 

I|tsII<x,<i 

However, for any N > s, the (S , N) -uniform irrepresentable condition does hold. We 
moreover have 

^adaptive(5') = \fs \\ T, 1}2 \\ 2,oo = V$P, 



i.e. (since A(S,s) = 1), the bounds of Lemma 4-1 an d Theorem 5.1 are strict in this 
example. 

Example 10.5 We recall that (^compatible (S) > 4>(S, s). Here is an example where the 
compatibility condition holds with reasonable ^compatible (&)> w h ere the restricted eigen- 
value 4> 2 {S, s) is very small. Assume s > 2. Let the first s indices {1, . . . , s] be the active 
set S with corresponding (s x s) covariance matrix £1,1, and suppose that 

S := diag(Si j i,I), 
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where 

Si,i = diag(5,7), 
and, for some < p < 1 — l/(s — 2), 



We i/ien /icrae 



1 P 
P 1 



= (1 - p)(/? 2 + 0i) + p(/3i + p 2 ) 2 + ^ /3 2 

>(l-p)(/3 2 + /3 2 2 ) + (^|/3,|) 2 /( S -2) 
i=3 

.Hence, 

min /gSx.x/fe > ,„ (1 " + + (1 - Iftl - lftl) 2 /0> - 2) 

l|PSl|l =1 |Pl| + |P2|<l I. 



= mm 

|/Ji|+|/fc|<i 



J=l,2 

1 

+ 



( s _ 2 )(^-2)(l-p) + lj S 2 

^ ( s -2)(l-p)-l 
( s -2)(( S -2)(l-p) + l 

It follows that 

^compatible W J - mifl mo [j2 ^ 



> 



1 (*-2)(( a -2)(l-p) + l) 
( s _ 2 )(l-p)-l 



( S -2)(l-p) + l 
On i/ie ot/ier /jane/ 

2 (5, S )=A 2 (5, S ) = (l-p). 
Hence, for example when 1 — p = 3/(s — 2), we gei 

^compatible (£) > I/ 2 
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and 

Clearly, for large s, this means that </> C ompatibie(5') is much better behaved than <j)(S,s). 
Note that large s in this example (with 1 — p = 3/(s — 2)) corresponds to a correlation p 
close to one, i.e., to a case where £ is "almost" singular. 



11 Adding noise 

We now consider the Lasso estimator based on n noisy observations. Let G X {i = 
1, . . . ,n) be the co- variables, and Y{ G R (i = 1, . . . , n) be the response variables. The 
noisy Lasso is 



f 1 

/3 := argmin - £ |Y, - fp(X t )\ 2 + A||/?||i 



The design matrix is 

X = X„ xp := (ipj(Xi)). 

The empirical Gram matrix is 



£ := X J X 



/n = J ^ T ^dQ n = ((Tj,k), 



where Q ra is the empirical measure Q n := Y17=i The ^2(Qn)-norm is denoted by 

|| • || n . We moreover let (•, •)„ be the L2(Qn)-inner product. 

As before, we write f° = fpo and now, f = f^. We consider 

ei:=Y t -f°(X t ), i = l,...,n, 
as the noise. Moreover, we write (with some abuse of notation) 



n 
i=i 



and we define 



A := 2 max | (*/>,-, e) n | 
i<i<p 



Here is a simple example which shows how Ao behaves in the case of i.i.d. standard normal 
errors. 

Lemma 11.1 Suppose that e±, . . . ,e n are i.i.d. M(0, 1) -distributed, and that djj = 1 for 
all j. Then we have for all t > 0, and for 



2£ + 21ogp 



n 



A (t) := 2 



P 2 max \(tpj,e) n \ < \ {t) > l-2exp[-t]. 
' i<i<p 
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Proof. As u jj = 1, we know that Vj := y / n(V ; j, e) n is AA(0, l)-distributed. So 

2t + 21ogp" 



P ( max \Vj\ > y/2t + 21ogp ) < 2pexp 
\i<j<p ) 



11.1 Prediction error in the noisy case 



2 exp \-t\ 



A noisy counterpart of Lemma 2.1 



is: 



Lemma 11.2 Take A > Ao, and define L := (A + Ao)/(A — Ao). Then 



L-l" (i-l) 2 ^ mpatibl „(S,i,S) 
Proof of Lemma 111.21 Because 

2|(e, / - /°)| < (2 max e)f) 11/3 - < M\P ~ 
we now have the Basic Inequality 

11/ - fill + All/Mi < Aoll/3 - + A||0°||i. 



Hence, 
Thus, 

This implies 



11/ - f\\l + (A - Xo)\\M\i < (A + A°)||/% ~ P°sh- 
\\Ps4i < L\\Ps ~ fish- 

WPs-fisWl < ^ll/-/°l|n/0compatiblc(S,^,5). 



□ 



So we arrive at 

11/ " /°lln + (A " A<))||/Ml < (A + X°)Vs\\f - /Incompatible^, L, S). 

Now, insert A = \ {L + 1)/(L - 1). □ 

In a similar way, but using (S, 2s)-restricted eigenvalue conditions, one may prove £2- 
convergence in the noisy case. 

Observe that the S-compatibility condition now involves the matrix S, which is definitely 
singular when p > n. However, we have seen in the previous section that, also for such 
X, compatibility conditions and restricted eigenvalue conditions hold in fairly general 
situations. 
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11.2 Noisy KKT 



The KKT conditions in the noisy case become 

2 Oj, / - f°)n ~ 2(ipj, e)„ = -At,-, j = 1, . . . ,p, 
or in matrix notation, 

2£(/3 - /3°) - X T e/n = -Af, 
where ||f ||oo < 1, and t,- := sign (/?.,•) whenever /3j 7^ 0. 

To avoid too many repetitions, let us only formulate the noisy version of a part of Part 1 
of Lemma 16.21 



Lemma 11.3 Take A > Ao, and define L := (A + Ao)/(A — Ao). Suppose the uniform 
(Ti, L, S, s)-irrepresentable condition holds. Then S C S. 

Proof of Lemma 11. 3\ This follows from a straightforward generalization of Lemma 



6.1 where the equalities now become inequalities: 



II 2 
211(7^)^11^ < z ^ i X E 2)1 (S)E^ 1 (S)ts - ^-Aoll/Mli- 



Here, f s is the anti-projection of /, in L2(Q n ), on the space spanned by {ipj}j e s- 

a 



10.1 



The noisy KKT conditions involve the matrix X. Again, as discussed in Subsection 
we may replace it by an approximation. As a consequence, if this approximation is good 
enough, we can replace (S, L, S, s)-irrepresentable conditions by (X, L, S, s)-irrepresentable 
conditions, provided we take L > L large enough. 

Lemma 11.4 Take A > Ao, and define L := (A + Ao)/(A — Ao). Suppose that 

doo(S, S) < A, 

and 



and in fact, that 



Then 



^compatible S) > (L + l)V\s 

(L + l)VTs 



< 1. 



(^compatible (S, L, S) — (L + 1) V As 



2 A 

(S-S)(/3-/3 )|| oo < 



L — 1 

Proof of Lemma 111.41 We have 

||(E - E)0 - (3 )\\oo < \\\P -(3°\\i<(L + l)\\\p s -(3% 

< (L + l)A>/i||/-/ O ||„/0oQmpatible(S,A5) 
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2A (L + l) 2 As 
< 2A (L + 1) 2 AsAq 

- 1) ^compatible (S, L, 5) - (L + 1)\/Is^ 

□ 

We conclude that the KKT conditions in the noisy case can be exploited in the same way 
as in the case without noise, albeit that one needs to adjust the constants (making the 
conditions more restrictive). 



12 Discussion 



We show how various conditions for Lasso oracle results relate to each other, as illustrated 
in Figure [T] Thereby, we also introduce the restricted regression condition. 

For deriving oracle results for prediction and estimation, the compatibility condition is the 



weakest. Looking at the derivation of the oracle result in Lemma 2.1 no substantial room 



seems to be left to improve the condition. The restricted eigenvalue condition is slightly 



stronger but in some cases, as demonstrated in Example |10.5 the compatibility condition 
is a real improvement. 

For variable selection with the Lasso, the irrepresentable condition is sufficient (assuming 
sufficiently large non-zero regression coefficients) and essentially necessary. We present the, 
perhaps not unexpected, but as yet not formally shown, result that the irrepresentable 
condition is always stronger than the compatibility condition. 



We illustrate in Section 10 how - in theory - one can verify the compatibility condition. If 
the sparsity is of small order o( y/n/ log p) , we can approximate the empirical Gram matrix 
by the population analogue. It is then much more easy and realistic that the population 
Gram matrix has sufficiently regular behavior, as illustrated with our examples in Section 
We believe moreover that a sparsity bound of small order o(\f nj logp) covers a 



10.2 



large area of interesting statistical problems. With larger s, the statistical situation is 
comparable to one of a nonpar ametric model with "(effective) smoothness less than 1/2", 
leading to very slow convergence rates. In contrast, for example in decoding problems, 
sparseness up to the linear-in-n regime can be very important. Moreover, in the case of 
robust convex loss, one may apply the compatibility condition directly to the population 
matrix, i.e., the sparsity regime s = o(y / n/logp) can be relaxed for such loss functions 



(see van de Geer (20081). We therefore conclude that oracle results for the Lasso hold 



under quite general design conditions. 

A final remark is that in our formulation, the compatibility condition and restricted eigen- 
value condition depend on the sparsity s as well as on the active set S. As S is unknown, 
this means that for a practical guarantee, the conditions should hold for all S. Moreover, 
one then needs to assume the sparsity s to be known, or at least a good upper bound 
needs to be given. Such strong requirements are the price for practical verifiability. We 
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however believe that in statistical modeling, non-verifiable conditions are allowed and in 
fact common practice. Moreover, our model assumes a sparse linear "truth" with "true" 
active set S, only for simplicity. Without such assumptions, there is no "true" S, and the 
oracle inequality concerns a trade-off between sparse approximation and estimation error, 
see for example van de Geer ( 2008 1 . 
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